Project Overview

Please we need to discuss project motivation and our experiment, just as Rida asked

Then we can begin our analysis

Project Setup

Business Understanding

  • [10 points] Describe the purpose of the data set you selected (i.e., why was this data bcollected in the first place?). How will you measure the effectiveness of a good algorithm? Why does your chosen validation method make sense for this specific dataset and the stakeholders needs?

We chose this dataset from the UCI’s machine learning repository for its categorical predictive attributes. It contains 1994 Census data pulled from the US Census database. The prediction task we’ve set forth is to predict if a person salary range is >50k in a 1994, based on the various categorical/numerical attributes in the census database. The link to the data source is below:

https://archive.ics.uci.edu/ml/datasets/census+income

The effectiveness of our algorithm will be determined by support, confidence and lift. As these are the metrics that describe how strong a relationship between each element is with the other elements within each transaction. Currently, there are no methods for cross validation of association rules, although there are some hard working individuals out there that are attempting to create such a tool.

Data Understanding

  • [10 points] Describe the meaning and type of data (scale, values, etc.) for each attribute in the data file. Verify data quality: Are there missing values? Duplicate data? Outliers? Are those mistakes? How do you deal with these problems?
  • [10 points] Visualize the any important attributes appropriately. Important: Provide an interpretation for any charts or graphs.

Here we will discuss each attribute and give some description about its ranges.

Categorical Attributes

  • workclass - Which business sector do they work in?
  • education - What level of education received?
  • marital_status - What is their marriage history
  • occupation - What do they do for a living
  • relationship - Family member relation
  • race - What is the subjects race
  • gender - What is the subjects gender
  • native_country - Where is the subject originally from
  • income_bracket - Do they make over or under 50k/year

Continuous Attributes

  • age - How old is the subject?
  • fnlwgt - Sampling weight of observation
  • education_num - numerical encoding of education variable
  • capital_gain - income from investment sources, separate from wages/salary
  • capital_loss - losses from investment sources, separate from wages/salary
  • hours_per_week - How many hours a week did they work?

Setup

First, lets go ahead and load up necessary libraries:

Loaded Packages:
   arules, arulesViz, forcats, dplyr, plotly, data.table,
   pander, knitr, skimr, lubridate, ggplot2, cowplot,
   foreach, doParallel

Next, lets import our dataset

data <- read.csv("./data/adult-training.csv")

Data Quality Inspection

The first thing we must do is check and see if there are any NAs in our dataset, just to make sure to not mess up our analysis.

NA_sum <- sort(sapply(data, function(x) sum(is.na(x))), decreasing = TRUE)
data.frame((NA_sum))

Data Cleaning

Looks like we are doing ok here. The next issue we have in the dataset, is because of the way the csv was stored, some of the levels in our factors include leading and trailing whitespace. This is highly undesirable, so we must clean it up:

GetFactors <- function(df) {
    return(names(Filter(is.factor, df)))
}
FixLevels <- function(x) {
    levels(x) <- trimws(levels(x))
    return(x)
}
data[GetFactors(data)] <- lapply(data[GetFactors(data)], FixLevels)
pander(lapply(data[GetFactors(data)], levels))
  • workclass: ?, Federal-gov, Local-gov, Never-worked, Private, Self-emp-inc, Self-emp-not-inc, State-gov and Without-pay
  • education: 10th, 11th, 12th, 1st-4th, 5th-6th, 7th-8th, 9th, Assoc-acdm, Assoc-voc, Bachelors, Doctorate, HS-grad, Masters, Preschool, Prof-school and Some-college
  • marital_status: Divorced, Married-AF-spouse, Married-civ-spouse, Married-spouse-absent, Never-married, Separated and Widowed
  • occupation: ?, Adm-clerical, Armed-Forces, Craft-repair, Exec-managerial, Farming-fishing, Handlers-cleaners, Machine-op-inspct, Other-service, Priv-house-serv, Prof-specialty, Protective-serv, Sales, Tech-support and Transport-moving
  • relationship: Husband, Not-in-family, Other-relative, Own-child, Unmarried and Wife
  • race: Amer-Indian-Eskimo, Asian-Pac-Islander, Black, Other and White
  • gender: Female and Male
  • native_country: ?, Cambodia, Canada, China, Columbia, Cuba, Dominican-Republic, Ecuador, El-Salvador, England, France, Germany, Greece, Guatemala, Haiti, Holand-Netherlands, Honduras, Hong, Hungary, India, Iran, Ireland, Italy, Jamaica, Japan, Laos, Mexico, Nicaragua, Outlying-US(Guam-USVI-etc), Peru, Philippines, Poland, Portugal, Puerto-Rico, Scotland, South, Taiwan, Thailand, Trinadad&Tobago, United-States, Vietnam and Yugoslavia
  • income_bracket: <=50K and >50K

Next, we need to reencode our data as factors. First, lets encode the education levels into factors with larger groups (for example 1st-12th grade should be no diploma, not a bunch of levels).

data$education <- fct_collapse(data$education, `No Diploma` = c("1st-4th", "5th-6th", 
    "7th-8th", "9th", "10th", "11th", "12th", "Preschool"), Associates = c("Assoc-acdm", 
    "Assoc-voc"), Diploma = c("Some-college", "HS-grad"))

Then the the income brackets:

data$income_bracket <- fct_collapse(data$income_bracket, small = "<=50K", large = ">50K")

Next, lets change the ? levels to something more useful:

levels(data$workclass)[levels(data$workclass) == "?"] <- "Other"
levels(data$occupation)[levels(data$occupation) == "?"] <- "Other-service"
levels(data$native_country)[levels(data$native_country) == "?"] <- "Other"

Next, lets remove the fnlwgt, education number, and capital gain and loss columns, as they are unneeded. We also need to rename some columns to be easier for us, and use the cut function to factorize our numeric variables

data <- data[, -c(3, 5, 11:12)]
data$age <- cut(data$age, breaks = c(15, 25, 45, 65, 100), labels = c("Young", 
    "Middleaged", "Senior", "Retired"))
data$hours_per_week <- cut(data$hours_per_week, breaks = c(0, 20, 40, 60, 80), 
    labels = c("part-time", "full-time", "hard-working", "need-a-life"))
str(data)
#> 'data.frame':    32561 obs. of  11 variables:
#>  $ age           : Factor w/ 4 levels "Young","Middleaged",..: 2 3 2 3 2 2 3 3 2 2 ...
#>  $ workclass     : Factor w/ 9 levels "Other","Federal-gov",..: 8 7 5 5 5 5 5 7 5 5 ...
#>  $ education     : Factor w/ 7 levels "No Diploma","Associates",..: 3 3 5 1 3 6 1 5 6 3 ...
#>  $ marital_status: Factor w/ 7 levels "Divorced","Married-AF-spouse",..: 5 3 1 3 3 3 4 3 5 3 ...
#>  $ occupation    : Factor w/ 14 levels "Other-service",..: 2 5 7 7 10 5 1 5 10 5 ...
#>  $ relationship  : Factor w/ 6 levels "Husband","Not-in-family",..: 2 1 2 1 6 6 2 1 2 1 ...
#>  $ race          : Factor w/ 5 levels "Amer-Indian-Eskimo",..: 5 5 5 3 3 5 3 5 5 5 ...
#>  $ gender        : Factor w/ 2 levels "Female","Male": 2 2 2 2 1 1 1 2 1 2 ...
#>  $ hours_per_week: Factor w/ 4 levels "part-time","full-time",..: 2 1 2 2 2 2 1 3 3 2 ...
#>  $ native_country: Factor w/ 42 levels "Other","Cambodia",..: 40 40 40 40 6 40 24 40 40 40 ...
#>  $ income_bracket: Factor w/ 2 levels "small","large": 1 1 1 1 1 1 1 2 2 2 ...

Lets see the results:

levels(data$workclass)
#> [1] "Other"            "Federal-gov"      "Local-gov"       
#> [4] "Never-worked"     "Private"          "Self-emp-inc"    
#> [7] "Self-emp-not-inc" "State-gov"        "Without-pay"
pander(summary(data))
Table continues below
age workclass education
Young : 6411 Private :22696 No Diploma : 4253
Middleaged:16523 Self-emp-not-inc: 2541 Associates : 2449
Senior : 8469 Local-gov : 2093 Bachelors : 5355
Retired : 1158 Other : 1836 Doctorate : 413
NA State-gov : 1298 Diploma :17792
NA Self-emp-inc : 1116 Masters : 1723
NA (Other) : 981 Prof-school: 576
Table continues below
marital_status occupation relationship
Divorced : 4443 Other-service :5138 Husband :13193
Married-AF-spouse : 23 Prof-specialty :4140 Not-in-family : 8305
Married-civ-spouse :14976 Craft-repair :4099 Other-relative: 981
Married-spouse-absent: 418 Exec-managerial:4066 Own-child : 5068
Never-married :10683 Adm-clerical :3770 Unmarried : 3446
Separated : 1025 Sales :3650 Wife : 1568
Widowed : 993 (Other) :7698 NA
Table continues below
race gender hours_per_week
Amer-Indian-Eskimo: 311 Female:10771 part-time : 2928
Asian-Pac-Islander: 1039 Male :21790 full-time :20052
Black : 3124 NA hard-working: 8471
Other : 271 NA need-a-life : 902
White :27816 NA NA’s : 208
NA NA NA
NA NA NA
native_country income_bracket
United-States:29170 small:24720
Mexico : 643 large: 7841
Other : 583 NA
Philippines : 198 NA
Germany : 137 NA
Canada : 121 NA
(Other) : 1709 NA

EDA

We’d also like to get a quick feel for the dataset through some visulizations.

p1 <- ggplot(data, aes(x = age, color = income_bracket, fill = income_bracket)) + 
    geom_density(alpha = 0.9) + labs(x = "Age", y = "Density", title = "Age Density by Income", 
    subtitle = "Density plot")
p2 <- ggplot(data, aes(x = education, fill = income_bracket, color = income_bracket)) + 
    geom_bar(alpha = 0.9, position = "fill") + coord_flip() + labs(x = "Education", 
    y = "Proportion", title = "Income bias based on Education", subtitle = "Stacked bar plot")
p3 <- ggplot(data, aes(x = marital_status, fill = income_bracket, color = income_bracket)) + 
    geom_bar(alpha = 0.9, position = "fill") + coord_flip() + labs(x = "Marital Status", 
    y = "Proportion", title = "Income bias based on Marital status", subtitle = "Stacked bar plot")
p4 <- ggplot(data, aes(x = occupation, fill = income_bracket, color = income_bracket)) + 
    geom_bar(alpha = 0.9, position = "fill") + coord_flip() + labs(x = "Occupation Status", 
    y = "Proportion", title = "Income bias based on Occupation status", subtitle = "Stacked bar plot")
p5 <- ggplot(data, aes(x = hours_per_week, color = income_bracket)) + labs(x = "Hours per week", 
    title = "Hours per week by Income", subtitle = "Density plot")
p6 <- ggplot(data, aes(occupation)) + geom_bar(aes(fill = education), width = 0.5) + 
    theme(axis.text.x = element_text(angle = 60, vjust = 0.5)) + labs(title = "Histogram of occupation with education binning", 
    subtitle = "Occupation and Educational")
p1

p2

p3

p4

p5

p6

####-TODO-Discuss graphs

Finally, we can set up our dataset to be the proper data format for the Apriori algorithm:

data
data <- as(data, "transactions")
summary(data)
#> transactions as itemMatrix in sparse format with
#>  32561 rows (elements/itemsets/transactions) and
#>  102 columns (items) and a density of 0.1077805 
#> 
#> most frequent items:
#> native_country=United-States                   race=White 
#>                        29170                        27816 
#>         income_bracket=small            workclass=Private 
#>                        24720                        22696 
#>                  gender=Male                      (Other) 
#>                        21790                       231771 
#> 
#> element (itemset/transaction) length distribution:
#> sizes
#>    10    11 
#>   208 32353 
#> 
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#>   10.00   11.00   11.00   10.99   11.00   11.00 
#> 
#> includes extended item information - examples:
#>           labels variables     levels
#> 1      age=Young       age      Young
#> 2 age=Middleaged       age Middleaged
#> 3     age=Senior       age     Senior
#> 
#> includes extended transaction information - examples:
#>   transactionID
#> 1             1
#> 2             2
#> 3             3

Modeling and Evaluation

  • Different tasks will require different evaluation methods. Be as thorough as possible when analyzing the data you have chosen and use visualizations of the results to explain the performance and expected outcomes whenever possible. Guide the reader through your analysis with plenty of discussion of the results. For this task, we chose Option B: Association Rule Mining.

Option B: Association Rule Mining • Create frequent itemsets and association rules. • Use tables/visualization to discuss the found results. • Use several measure for evaluating how interesting different rules are. • Describe your results. What findings are the most compelling and why?

Before we begin with our analysis, lets check out the rule frequencies within the dataset. We are looking for rules with support >= .2

itemFrequencyPlot(data, support = 0.2)

Rule mining

Next, lets mine some rules with the apriori algorithm, and then clean up redundant rules. We are still sorting out what to set the minsupp and minconf to.

zerules <- apriori(data, parameter = list(minlen = 2, supp = 0.2, conf = 0.15), 
    appearance = list(rhs = c("income_bracket=small", "income_bracket=large"), 
        default = "lhs"), control = list(verbose = F))
length(zerules)
#> [1] 91
redundant <- is.redundant(zerules)
zerules.pruned <- zerules[redundant == FALSE]
rulesorted <- sort(zerules.pruned, by = "lift", decreasing = TRUE)
length(rulesorted)
#> [1] 25

Rule quality and inspection

Next, let us inspect the rules, and examine their quality

(quality(rulesorted))
inspectDT(rulesorted)

Plots

First lets view a scatterplot of our rules

plot(rulesorted, method = "scatterplot", measure = c("confidence", "support"), 
    shading = "lift", engine = "htmlwidget")

Next lets look at a balloon plot

plot(rulesorted, method = "graph", measure = "confidence", shading = "lift", 
    engine = "htmlwidget")

Parallel plot

plot(rulesorted, method = "paracoord", measure = "confidence", shading = "lift", 
    control = list(reorder = T))

Two key plot

plot(rulesorted, method = "two-key plot", measure = "confidence", shading = "lift", 
    engine = "htmlwidget")

grouped plot

plot(rulesorted, method = "grouped", measure = "confidence", shading = "lift")

alternate rule mining

rule2 <- apriori(data, parameter = list(minlen = 2, supp = 0.1, conf = 0.9), 
    appearance = list(rhs = c("income_bracket=small", "income_bracket=large"), 
        default = "lhs"), control = list(verbose = F))
length(rule2)
#> [1] 128
redundant <- is.redundant(rule2)
rulep <- rule2[redundant == FALSE]
rulesorted2 <- sort(rulep, by = "lift", decreasing = TRUE)
length(rulesorted2)
#> [1] 56

Inspection

head(quality(rulesorted2))
inspectDT(rulesorted2)

Plotting

plot(rulesorted2, method = "scatterplot", measure = c("confidence", "support"), 
    shading = "lift", engine = "htmlwidget")
plot(rulesorted2, method = "graph", measure = "confidence", shading = "lift", 
    engine = "htmlwidget")
plot(rulesorted2, method = "two-key plot", measure = "confidence", shading = "lift", 
    engine = "htmlwidget")
plot(rulesorted2, method = "grouped", measure = "confidence", shading = "lift")

Deployment

Be critical of your performance and tell the reader how you current model might be usable by other parties. Did you achieve your goals? If not, can you reign in the utility of your modeling?

• How useful is your model for interested parties (i.e., the companies or organizations that might want to use it)? • How would your deploy your model for interested parties? • What other data should be collected? • How often would the model need to be updated, etc.?

Exceptional Work